Processing input longer then model max input token length

starsofchance · June 1, 2025, 6:12pm

okay
So this is my question:
I have a Mistral model with 32k max input tokens
And I am planning on fine-tuning this model for vulnerability detection
Now in my dataset, I have inputs longer then 32k
I want to know how I should feed the model these functions?
Since the model must see the full code, I can’t use truncation
, and since it is vulnerablity detection an classification, the semantic of the input code is important,
has any one encounter this problem?
what do you think is the best method of processing this part of the dataset?

Mdrnfox · June 1, 2025, 7:59pm

You could try a preprocessing model and trim the excessive inputs into acceptable length with summarization possibly. Additionally you could look into abstract syntax trees. The best approach would be a sliding context window on those excessive length inputs and aggregate the overlaps I think

starsofchance · June 1, 2025, 9:01pm

so i checked the sliding window, BUT my models do not support it
and regarding trimming the input into multiple chunk, I need specific gide on what to do with the output labels
for example, lets say I am chunking a vulnerable code, but all the code is not vulnerable. how do we decide if a chunk of a code that was part of a vulnerable code is vulnerable or not? even tough that the specific chunk does not contain the vulnerable line!!! this is one of my main questions.

Mdrnfox · June 1, 2025, 11:51pm

You could annotate the dataset where the functions are vulnerable and maybe provide some context for the function? You could also use like a vulnerability CVE dataset maybe? I’m kind of weak in cybersecurity

Topic		Replies	Views
Highlighting important tokens for input into LLM 🤗Transformers	0	237	November 18, 2023
Question on splitting input sequence Beginners	3	5536	June 14, 2022
Truncating sequence -- within a pipeline Beginners	7	5718	May 3, 2024
Fine tuned Mistral 7B inference issue for >4k context length token with transformer 4.35+ 🤗Transformers	0	547	December 11, 2023
Poor performance from Mistral-7B-Instruct-v0.1 Beginners	1	1520	March 1, 2024

Processing input longer then model max input token length

Related topics